I really like pandas – the powerful data analysis framework for Python. And I really like pygal – an interactive visualization library written in and for Python.
Why not put these two libraries together for effective data visualizations?
In this blog post, I want to show you some basic use cases and integration tips between pandas as pygal.
We need some kind of data. Which one doesn't really matter. Here I have a dataset that was produced to measure the utilization of source code during program execution. It shows the lines of source code that were executed (covered) or missed during a production coverage measurement.
As usual, we load this data with pandas first.
In [1]:
import pandas as pd
raw = pd.read_csv("datasets/jacoco_production_coverage_spring_petclinic.csv")
raw.head()
Out[1]:
Let's create a nice dataframe that makes this data better consumable later.
In [2]:
df = pd.DataFrame(index=raw.index)
df['class'] = raw['PACKAGE'] + "." + raw['CLASS']
df['lines'] = raw['LINE_MISSED'] + raw['LINE_COVERED']
df['coverage'] = raw['LINE_COVERED'] / df['lines']
df.head()
Out[2]:
In [3]:
from IPython.display import display, HTML
base_html = """
<!DOCTYPE html>
<html>
<head>
<script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
<script type="text/javascript" src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js""></script>
</head>
<body>
<figure>
{rendered_chart}
</figure>
</body>
</html>
"""
The core idea is to let pandas create the data in a format that pygal's visualizations can consume easily. So let's have a look at what pygal expects as input data.
Here is a basic example for a bar chart (adapted from pygal's documentation) and take a look at the visualization (hint: it's interactive!).
In [4]:
import pygal
bar_chart = pygal.Bar(height=200)
bar_chart.title = 'Browser usage evolution (in %)'
bar_chart.x_labels = map(str, range(2002, 2013))
bar_chart.add('Firefox', [None, None, 0, 16.6, 25, 31, 36.4, 45.5, 46.3, 42.8, 37.1])
bar_chart.add('Chrome', [None, None, None, None, None, None, 0, 3.9, 10.8, 23.8, 35.3])
bar_chart.add('IE', [85.8, 84.6, 84.7, 74.5, 66, 58.6, 54.7, 44.8, 36.2, 26.6, 20.1])
bar_chart.add('Others', [14.2, 15.4, 15.3, 8.9, 9, 10.4, 8.9, 5.8, 6.7, 6.8, 7.5])
display(HTML(base_html.format(rendered_chart=bar_chart.render(is_unicode=True))))
One of the important lines it this one:
bar_chart.add('Firefox', [None, None, 0, 16.6, 25, 31, 36.4, 45.5, 46.3, 42.8, 37.1])
For each bar chart category (like "Firefox" or "Chrome"), we need to call the add
function and provide the data.
Let's go back to our own dataset. First, we create a category that makes some kind of sense for our use case. Let's use the name of a technical aspect of a source code file as our category. We can find this information at a specific part in the class
column (at least for most cases).
In [5]:
df['category'] = df['class'].str.split(".").str[-2]
df.head()
Out[5]:
In [6]:
mean_by_category = df.groupby('category')['coverage'].mean()
mean_by_category
Out[6]:
We just iterate over all entries and add these to the bar chart by using a list comprehension.
In [7]:
bar_chart = pygal.Bar(height=200)
[bar_chart.add(x[0], x[1]) for x in mean_by_category.items()]
display(HTML(base_html.format(rendered_chart=bar_chart.render(is_unicode=True))))
So this is pretty standard and easy to do.
Let's look at a slightly more sophisticated use case: showing coverage values for all classes and color the classes accordingly to the category they belong to.
For this, a bar chart doesn't make sense anymore. So let's look at another visualization type.
A treemap generates size-based tiles of a dataset and orders them together in a nicely way.
The key idea to integrate pandas with pygal is to use the pandas' groupby
-function to get the data in a format that pygal can consume. The special trick is to put all the coverage
-values into a list for each category.
In [8]:
values_by_category = df.groupby(['category'])['lines'].apply(list)
values_by_category
Out[8]:
This format is exactly what pygal needs. Let's create the treemap out of this data by using a list comprehension again.
In [9]:
treemap = pygal.Treemap(height=200)
[treemap.add(x[0], x[1]) for x in values_by_category.items()]
display(HTML(base_html.format(rendered_chart=treemap.render(is_unicode=True))))
You might have noticed that the labels on mouse-over actions don't show the actual class name but rather the name of the category. Instead of passing a list of values, we need to differentiate between the actual value and the corresponding label for each value. We can do this by passing an appropriate dictionary.
chart.add('category', [{'value' : 1, 'label': 'one'}, {'value': 2, 'label': 'two'}])
Let's fix this with another trick: We can iterate of the necessary data during the grouping of the values. For this, we have to combine the data that we need with the zip
command an build a data dictionary within in the apply
action.
In [10]:
class_values_by_category = df.groupby(['category'], axis=0).apply(
lambda x : [{"value" : l, "label" : c } for l, c in zip(x['lines'], x['class'])])
class_values_by_category
Out[10]:
If we generate the treemap once again, you can spot the difference in the visualization by hovering over the tiles with your pointing device.
In [11]:
treemap = pygal.Treemap(height=200)
[treemap.add(x[0], x[1]) for x in class_values_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=treemap.render(is_unicode=True))))
In the final step, I want to show you how you can colorize the tiles as needed. In our case, the column coverage
is a perfect candidate for this, because it shows the ratio of executed code lines. A value near 1 means that almost all code lines were executed. A value near 0 means that the code line didn't ran.
Let's see if we can visualize this in the treemap, too. For this, we need two things:
coverage
column)There are many ways to do it, but the most basic way is so assign every indicator value a corresponding color. For this, we'll us a red to blue colormap from matplot lib an draw colors appropriately.
In [12]:
from matplotlib.cm import coolwarm
from matplotlib.colors import rgb2hex
df['color'] = df['coverage'].apply(lambda x : rgb2hex(coolwarm(x)))
df.head()
Out[12]:
In [13]:
class_ratios_by_category = df.groupby(['category'], axis=0).apply(
lambda x : [
{"value" : y,
"label" : z,
"color" : c} for y, z, c in zip(
x['lines'],
x['class'],
x['color'])])
class_ratios_by_category
Out[13]:
Let's plot this treemap. We disable the legend, because it doesn't make sense anymore (the colors of the legend doesn't represent the colors in the treemap anymore).
In [14]:
treemap = pygal.Treemap(height=200, show_legend=False)
[treemap.add(x[0], x[1]) for x in class_ratios_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=treemap.render(is_unicode=True))))
One problem exists though: The value in the lower left corner in the tooltip is the lines value. In the case that we want to display another value there (e. g. the coverage value), we need to hack the system a little bit by introduction a value formatter. This formatter needs a formatting function that we can happily provide (but surley not in a way the library designer originally thought how to do uit ;-) ).
In [15]:
class_ratios_hack_by_category = df.groupby(['category'], axis=0).apply(
lambda x : [
{"value" : y,
"label" : z,
"color" : c,
"formatter" : lambda x : "{0:.0%}".format(f)} for y, z, c, f in zip(
x['lines'],
x['class'],
x['color'],
x['coverage'])])
class_ratios_hack_by_category
Out[15]:
In [16]:
treemap = pygal.Treemap(height=200, show_legend=False, colors=["#ffffff"])
[treemap.add(x[0], x[1]) for x in class_ratios_hack_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=treemap.render(is_unicode=True))))
There are many other visualization types that you can use with these tricks. Let's take a look at the dataset from the beginning.
In [17]:
mean_by_category
Out[17]:
We can visualize this e. g. as gauge chart.
In [18]:
gauge = pygal.SolidGauge(inner_radius=0.70)
[gauge.add(x[0], [{"value" : x[1] * 100}] ) for x in mean_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=gauge.render(is_unicode=True))))
Or in another variant of it...
In [19]:
gauge = pygal.Gauge(human_readable=True)
[gauge.add(x[0], [{"value" : x[1] * 100}] ) for x in mean_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=gauge.render(is_unicode=True))))
OK, STOP! Enough for today!
Allright, that's it for this blog post! I hope you have seen that (if you know some tricks), you can easily integrate pandas with pygal!
I find this combination a nice tradeoff between complexity and interactivity. Let me now if I can simplyfy or explain one or two things more deeply.
Maybe next time, we can take a look at some tricks regarding D3, can't we?